MapReduce Algorithm for Word Counting on Fraudulent Email Corpus

Project Overview

This project applies the MapReduce algorithm to analyze the Fraudulent E-Mail Corpus, a dataset of over 2,500 phishing emails. The goal is to identify the 20 most frequently used words and evaluate how word frequency reflects the phishing nature of the dataset.

Key Features

Implemented a custom MapReduce algorithm for text data processing.
Processed data without dictionary-based structures to calculate word frequencies.
Extracted and analyzed the 20 most frequent words from the dataset.

Tools Used

Python
Jupyter Notebook

Visualizations

Bar chart of the 20 most frequent words.
Word cloud visualization for frequent words.

View the Code

Click the link below to view the full code and documentation for this project on GitHub:

View on GitHub